4
4 Analysis Question 2
4.1 Restatement of Problem
In this analysis, we are asked to build the most predictive model for sales prices of homes in all
neighborhoods of Ames, Iowa, using both the train and test datasets retrieved from the Kaggle website..
Our strategy is to produce four selection models: one from forward selection, one from backwards
elimination, one from stepwise selection, and one that is built custom. We will also evaluate the adjusted
R-Squared, internal CV Press, and the Kaggle score for each of the said four models.
4.2 Modeling
After much discussion amongst our team, we decided to test an initial model with all the available
variables, or rather data features, in the Ames, Iowa housing data set which is contained within a text file
named datadescription.txt retrieved from the Kaggle website. This dataset has a missing values
percentage, or in this case Not Available (NA), of 5.88% along with some special cases which will be
further discussed below. All of these are deemed crucial to be either removed or adjusted accordingly for
our analysis [see Section 5.6].
While going down the list of all available variables of the initial model and judging for its R-Squared
value, the sequential steps to clean the data consists of the following:
● NA numerical variables are replaced with zeros.
● NA non-numerical variables, or in this case character variables, are replaced with “None” to
represent the non-available feature for that variable.
● Character variables are converted into factor variables, which proves to improve our prediction
analysis as there is now one reference for each factor variable for RStudio to recognize and use.
● One influential observation, data point 524, was detected with a standout Cook’s D and leverage.
We decided to ultimately remove this observation since it may hinder our prediction analysis [see
Appendix 5.7].
● The variables Id, MSSubClass, MSZoning, Utilities, Condition2, Exterior1st, Exterior2nd,
MasVnrType, KitchenQual, Functional, SaleType and PoolQC contain levels in the testing dataset
that does not exist in the training dataset. Since those are categorical variables, all of the variables
need to be used in the training phase of the model. Hence, since we cannot train for levels not in
the training dataset, those variables are not to be used while training the model.
● At this time, the variables OverallQual and OverallCond are converted mistakenly into numeric
variables as they appear to be. These two variables actually proved to be more useful for our
analysis once they are turned into factor variables to better represent the data.
● SalePrice is log-transformed to be used as the response for our models since previously its data
was heavily right skewed [see Appendix 5.8].